Our audience comprises the acquisition team of IBM (International Business Machines Corporation), a global technology conglomerate renowned for its diverse hiring practices and varied workforce productivity. Central to their concerns is the challenge of worker attrition. Our objective is to provide guidance that ensures the recruitment of high-quality candidates with a diminished likelihood of turnover. By optimizing hiring practices in this manner, we aim to enhance profitability in both the short and long term for IBM.
Our paramount concern is addressing how we can optimize both the
caliber and overall productivity of our workforce, while safeguarding
against premature turnover. At the core of this inquiry lies the
imperative to maximize the profitability derived from each hire,
considering that recruiting individuals who swiftly depart leads to
negative returns on investment, particularly in terms of training costs.
Hence, it is imperative that we meticulously screen candidates to avert scenarios where early attrition undermines our bottom line
IBM’s workforce spans across various fields, yet recent trends reveal
a concerning pattern of premature departures, resulting in significant
profit setbacks. The root of the issue lies in the acquisition team’s
emphasis on short-term productivity, overlooking the crucial factor of
employee retention.
Similar to university admissions strategies, where institutions like
the University of Michigan might forgo exceptional applicants due to
concerns about their commitment, IBM must adopt a nuanced approach. We
propose a strategy that balances maximizing immediate output with the
long-term goal of retaining top talent.
Our solution involves screening candidates not only for their
potential contributions to IBM but also for their propensity to remain
with the company. By prioritizing individuals who demonstrate both high
potential and a commitment to long-term engagement, we mitigate the risk
of negative profit margins associated with frequent turnover.
In essence, our tailored models enable IBM to navigate the seemingly
paradoxical challenge of selecting candidates who are both highly
productive and likely to stay. By investing in employees who align with
the company’s long-term vision, IBM can secure greater profitability in
the years ahead.
In order to quantify the impact of our work, we have assigned
monetary values to our final outcomes, streamlining worker productivity
into a comprehensible variable. This variable encompasses several
factors: tenure, job level, engagement, overtime commitment, performance
ratings, and compensation, each meticulously calibrated to accurately
reflect worker quality. Furthermore, we’ve quantified the financial
impact of attrition, ensuring clarity and ease of interpretation for our
results.
When our model is applied to test data, its efficacy can be scrutinized and effectively communicated. Transparency is paramount in our reporting. Additionally, we’ve excluded variables that can only be determined post-hire from our training data, facilitating its utility for future candidate evaluations by the IBM acquisition team.
While our model serves as a valuable tool, it’s important to acknowledge its limitations. It’s not infallible and should be complemented with other considerations such as resume details and interview performance. Our recommendations are based on upfront data, like marital status or proximity to the office, and should be integrated with holistic hiring practices.
It’s essential to note that our model may not always identify the most qualified candidates, as it prioritizes retention probability over immediate output. This strategic focus might initially impact short-term productivity. However, the long-term benefits of reduced attrition and sustained workforce stability outweigh these potential shortfalls.
We read in the data set and take a quick look at it.
employee <- read.csv("attrition.csv")
str(employee)
## 'data.frame': 1470 obs. of 35 variables:
## $ Age : int 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : chr "Yes" "No" "Yes" "No" ...
## $ BusinessTravel : chr "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" "Travel_Frequently" ...
## $ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
## $ Department : chr "Sales" "Research & Development" "Research & Development" "Research & Development" ...
## $ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : int 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : chr "Life Sciences" "Life Sciences" "Other" "Life Sciences" ...
## $ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
## $ EmployeeNumber : int 1 2 4 5 7 8 10 11 12 13 ...
## $ EnvironmentSatisfaction : int 2 3 4 4 1 4 3 4 4 3 ...
## $ Gender : chr "Female" "Male" "Male" "Female" ...
## $ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
## $ JobInvolvement : int 3 2 2 3 3 3 4 3 2 3 ...
## $ JobLevel : int 2 2 1 1 1 1 1 1 3 2 ...
## $ JobRole : chr "Sales Executive" "Research Scientist" "Laboratory Technician" "Research Scientist" ...
## $ JobSatisfaction : int 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : chr "Single" "Married" "Single" "Married" ...
## $ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
## $ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
## $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
## $ Over18 : chr "Y" "Y" "Y" "Y" ...
## $ OverTime : chr "Yes" "No" "Yes" "Yes" ...
## $ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : int 3 4 3 3 3 3 4 4 4 3 ...
## $ RelationshipSatisfaction: int 1 4 2 3 4 3 1 2 2 2 ...
## $ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : int 0 1 0 0 1 0 3 1 0 2 ...
## $ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : int 1 3 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
summary(employee)
## Age Attrition BusinessTravel DailyRate
## Min. :18.00 Length:1470 Length:1470 Min. : 102.0
## 1st Qu.:30.00 Class :character Class :character 1st Qu.: 465.0
## Median :36.00 Mode :character Mode :character Median : 802.0
## Mean :36.92 Mean : 802.5
## 3rd Qu.:43.00 3rd Qu.:1157.0
## Max. :60.00 Max. :1499.0
## Department DistanceFromHome Education EducationField
## Length:1470 Min. : 1.000 Min. :1.000 Length:1470
## Class :character 1st Qu.: 2.000 1st Qu.:2.000 Class :character
## Mode :character Median : 7.000 Median :3.000 Mode :character
## Mean : 9.193 Mean :2.913
## 3rd Qu.:14.000 3rd Qu.:4.000
## Max. :29.000 Max. :5.000
## EmployeeCount EmployeeNumber EnvironmentSatisfaction Gender
## Min. :1 Min. : 1.0 Min. :1.000 Length:1470
## 1st Qu.:1 1st Qu.: 491.2 1st Qu.:2.000 Class :character
## Median :1 Median :1020.5 Median :3.000 Mode :character
## Mean :1 Mean :1024.9 Mean :2.722
## 3rd Qu.:1 3rd Qu.:1555.8 3rd Qu.:4.000
## Max. :1 Max. :2068.0 Max. :4.000
## HourlyRate JobInvolvement JobLevel JobRole
## Min. : 30.00 Min. :1.00 Min. :1.000 Length:1470
## 1st Qu.: 48.00 1st Qu.:2.00 1st Qu.:1.000 Class :character
## Median : 66.00 Median :3.00 Median :2.000 Mode :character
## Mean : 65.89 Mean :2.73 Mean :2.064
## 3rd Qu.: 83.75 3rd Qu.:3.00 3rd Qu.:3.000
## Max. :100.00 Max. :4.00 Max. :5.000
## JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
## Min. :1.000 Length:1470 Min. : 1009 Min. : 2094
## 1st Qu.:2.000 Class :character 1st Qu.: 2911 1st Qu.: 8047
## Median :3.000 Mode :character Median : 4919 Median :14236
## Mean :2.729 Mean : 6503 Mean :14313
## 3rd Qu.:4.000 3rd Qu.: 8379 3rd Qu.:20462
## Max. :4.000 Max. :19999 Max. :26999
## NumCompaniesWorked Over18 OverTime PercentSalaryHike
## Min. :0.000 Length:1470 Length:1470 Min. :11.00
## 1st Qu.:1.000 Class :character Class :character 1st Qu.:12.00
## Median :2.000 Mode :character Mode :character Median :14.00
## Mean :2.693 Mean :15.21
## 3rd Qu.:4.000 3rd Qu.:18.00
## Max. :9.000 Max. :25.00
## PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel
## Min. :3.000 Min. :1.000 Min. :80 Min. :0.0000
## 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:80 1st Qu.:0.0000
## Median :3.000 Median :3.000 Median :80 Median :1.0000
## Mean :3.154 Mean :2.712 Mean :80 Mean :0.7939
## 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:80 3rd Qu.:1.0000
## Max. :4.000 Max. :4.000 Max. :80 Max. :3.0000
## TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## Min. : 0.00 Min. :0.000 Min. :1.000 Min. : 0.000
## 1st Qu.: 6.00 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 3.000
## Median :10.00 Median :3.000 Median :3.000 Median : 5.000
## Mean :11.28 Mean :2.799 Mean :2.761 Mean : 7.008
## 3rd Qu.:15.00 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.: 9.000
## Max. :40.00 Max. :6.000 Max. :4.000 Max. :40.000
## YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 2.000
## Median : 3.000 Median : 1.000 Median : 3.000
## Mean : 4.229 Mean : 2.188 Mean : 4.123
## 3rd Qu.: 7.000 3rd Qu.: 3.000 3rd Qu.: 7.000
## Max. :18.000 Max. :15.000 Max. :17.000
We make a simple bar plot of the attrition variable
library(ggplot2)
ggplot(employee, aes(x = factor(Attrition), fill = factor(Attrition))) +
geom_bar(color = "black") +
labs(title = " Distibution of Attrition",
x = "",
y = "Amount") +
scale_x_discrete(labels = c("Not Attrition", "Attrition")) +
scale_fill_manual(values = c("darkgrey", "white"), guide = FALSE) +
# Set fill colors manually
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
axis.line = element_line(color = "black"))
## Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
## ggplot2 3.3.4.
## ℹ Please use "none" instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
We delete some useless variables and factor those that need to be factored. We then take another look at our updated data.
employee$Attrition <- as.factor(employee$Attrition)
employee$BusinessTravel <- as.factor(employee$BusinessTravel)
employee$Department <- as.factor(employee$Department)
employee$EducationField <- as.factor(employee$EducationField)
employee$Gender <- as.factor(employee$Gender)
employee$JobRole <- as.factor(employee$JobRole)
employee$MaritalStatus <- as.factor(employee$MaritalStatus)
employee$Over18 <- NULL
employee$OverTime <- as.factor(employee$OverTime)
str(employee)
## 'data.frame': 1470 obs. of 34 variables:
## $ Age : int 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
## $ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
## $ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : int 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
## $ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
## $ EmployeeNumber : int 1 2 4 5 7 8 10 11 12 13 ...
## $ EnvironmentSatisfaction : int 2 3 4 4 1 4 3 4 4 3 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
## $ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
## $ JobInvolvement : int 3 2 2 3 3 3 4 3 2 3 ...
## $ JobLevel : int 2 2 1 1 1 1 1 1 3 2 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
## $ JobSatisfaction : int 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
## $ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
## $ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
## $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
## $ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : int 3 4 3 3 3 3 4 4 4 3 ...
## $ RelationshipSatisfaction: int 1 4 2 3 4 3 1 2 2 2 ...
## $ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : int 0 1 0 0 1 0 3 1 0 2 ...
## $ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : int 1 3 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
summary(employee)
## Age Attrition BusinessTravel DailyRate
## Min. :18.00 No :1233 Non-Travel : 150 Min. : 102.0
## 1st Qu.:30.00 Yes: 237 Travel_Frequently: 277 1st Qu.: 465.0
## Median :36.00 Travel_Rarely :1043 Median : 802.0
## Mean :36.92 Mean : 802.5
## 3rd Qu.:43.00 3rd Qu.:1157.0
## Max. :60.00 Max. :1499.0
##
## Department DistanceFromHome Education
## Human Resources : 63 Min. : 1.000 Min. :1.000
## Research & Development:961 1st Qu.: 2.000 1st Qu.:2.000
## Sales :446 Median : 7.000 Median :3.000
## Mean : 9.193 Mean :2.913
## 3rd Qu.:14.000 3rd Qu.:4.000
## Max. :29.000 Max. :5.000
##
## EducationField EmployeeCount EmployeeNumber EnvironmentSatisfaction
## Human Resources : 27 Min. :1 Min. : 1.0 Min. :1.000
## Life Sciences :606 1st Qu.:1 1st Qu.: 491.2 1st Qu.:2.000
## Marketing :159 Median :1 Median :1020.5 Median :3.000
## Medical :464 Mean :1 Mean :1024.9 Mean :2.722
## Other : 82 3rd Qu.:1 3rd Qu.:1555.8 3rd Qu.:4.000
## Technical Degree:132 Max. :1 Max. :2068.0 Max. :4.000
##
## Gender HourlyRate JobInvolvement JobLevel
## Female:588 Min. : 30.00 Min. :1.00 Min. :1.000
## Male :882 1st Qu.: 48.00 1st Qu.:2.00 1st Qu.:1.000
## Median : 66.00 Median :3.00 Median :2.000
## Mean : 65.89 Mean :2.73 Mean :2.064
## 3rd Qu.: 83.75 3rd Qu.:3.00 3rd Qu.:3.000
## Max. :100.00 Max. :4.00 Max. :5.000
##
## JobRole JobSatisfaction MaritalStatus MonthlyIncome
## Sales Executive :326 Min. :1.000 Divorced:327 Min. : 1009
## Research Scientist :292 1st Qu.:2.000 Married :673 1st Qu.: 2911
## Laboratory Technician :259 Median :3.000 Single :470 Median : 4919
## Manufacturing Director :145 Mean :2.729 Mean : 6503
## Healthcare Representative:131 3rd Qu.:4.000 3rd Qu.: 8379
## Manager :102 Max. :4.000 Max. :19999
## (Other) :215
## MonthlyRate NumCompaniesWorked OverTime PercentSalaryHike
## Min. : 2094 Min. :0.000 No :1054 Min. :11.00
## 1st Qu.: 8047 1st Qu.:1.000 Yes: 416 1st Qu.:12.00
## Median :14236 Median :2.000 Median :14.00
## Mean :14313 Mean :2.693 Mean :15.21
## 3rd Qu.:20462 3rd Qu.:4.000 3rd Qu.:18.00
## Max. :26999 Max. :9.000 Max. :25.00
##
## PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel
## Min. :3.000 Min. :1.000 Min. :80 Min. :0.0000
## 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:80 1st Qu.:0.0000
## Median :3.000 Median :3.000 Median :80 Median :1.0000
## Mean :3.154 Mean :2.712 Mean :80 Mean :0.7939
## 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:80 3rd Qu.:1.0000
## Max. :4.000 Max. :4.000 Max. :80 Max. :3.0000
##
## TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## Min. : 0.00 Min. :0.000 Min. :1.000 Min. : 0.000
## 1st Qu.: 6.00 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 3.000
## Median :10.00 Median :3.000 Median :3.000 Median : 5.000
## Mean :11.28 Mean :2.799 Mean :2.761 Mean : 7.008
## 3rd Qu.:15.00 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.: 9.000
## Max. :40.00 Max. :6.000 Max. :4.000 Max. :40.000
##
## YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 2.000
## Median : 3.000 Median : 1.000 Median : 3.000
## Mean : 4.229 Mean : 2.188 Mean : 4.123
## 3rd Qu.: 7.000 3rd Qu.: 3.000 3rd Qu.: 7.000
## Max. :18.000 Max. :15.000 Max. :17.000
##
We run a for loop to look at a histogram of all of our numerical variables. This gives us a better idea of the makeup of the data.
library(ggplot2)
numerical_vars <- names(employee)[sapply(employee, is.numeric)]
for (var in numerical_vars) {
plot_title <- paste("Histogram of", var)
print(
ggplot(employee, aes_string(x = var)) +
geom_histogram(fill = "lightblue", color = "black") +
labs(title = plot_title, x = var, y = "Frequency") +
theme_minimal()
)
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We notice that some variables are useless as they are the same for every observation. Therefore, we delete these variables. We also notice that performance rating is actually a binary variable so we choose to factor it.
employee$EmployeeCount <- NULL
employee$StandardHours <- NULL
employee$EmployeeNumber <- NULL
employee$PerformanceRating <- as.factor(employee$PerformanceRating)
str(employee)
## 'data.frame': 1470 obs. of 31 variables:
## $ Age : int 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
## $ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
## $ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : int 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
## $ EnvironmentSatisfaction : int 2 3 4 4 1 4 3 4 4 3 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
## $ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
## $ JobInvolvement : int 3 2 2 3 3 3 4 3 2 3 ...
## $ JobLevel : int 2 2 1 1 1 1 1 1 3 2 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
## $ JobSatisfaction : int 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
## $ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
## $ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
## $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
## $ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : Factor w/ 2 levels "3","4": 1 2 1 1 1 1 2 2 2 1 ...
## $ RelationshipSatisfaction: int 1 4 2 3 4 3 1 2 2 2 ...
## $ StockOptionLevel : int 0 1 0 0 1 0 3 1 0 2 ...
## $ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : int 1 3 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
We now take a look at a correlation matrix of all of our remaining numerical variables.
numerical_vars <- employee[, sapply(employee, is.numeric)]
correlation_matrix <- cor(numerical_vars)
print(correlation_matrix)
## Age DailyRate DistanceFromHome
## Age 1.000000000 0.010660943 -0.001686120
## DailyRate 0.010660943 1.000000000 -0.004985337
## DistanceFromHome -0.001686120 -0.004985337 1.000000000
## Education 0.208033731 -0.016806433 0.021041826
## EnvironmentSatisfaction 0.010146428 0.018354854 -0.016075327
## HourlyRate 0.024286543 0.023381422 0.031130586
## JobInvolvement 0.029819959 0.046134874 0.008783280
## JobLevel 0.509604228 0.002966335 0.005302731
## JobSatisfaction -0.004891877 0.030571008 -0.003668839
## MonthlyIncome 0.497854567 0.007707059 -0.017014445
## MonthlyRate 0.028051167 -0.032181602 0.027472864
## NumCompaniesWorked 0.299634758 0.038153434 -0.029250804
## PercentSalaryHike 0.003633585 0.022703677 0.040235377
## RelationshipSatisfaction 0.053534720 0.007846031 0.006557475
## StockOptionLevel 0.037509712 0.042142796 0.044871999
## TotalWorkingYears 0.680380536 0.014514739 0.004628426
## TrainingTimesLastYear -0.019620819 0.002452543 -0.036942234
## WorkLifeBalance -0.021490028 -0.037848051 -0.026556004
## YearsAtCompany 0.311308770 -0.034054768 0.009507720
## YearsInCurrentRole 0.212901056 0.009932015 0.018844999
## YearsSinceLastPromotion 0.216513368 -0.033228985 0.010028836
## YearsWithCurrManager 0.202088602 -0.026363178 0.014406048
## Education EnvironmentSatisfaction HourlyRate
## Age 0.208033731 0.010146428 0.024286543
## DailyRate -0.016806433 0.018354854 0.023381422
## DistanceFromHome 0.021041826 -0.016075327 0.031130586
## Education 1.000000000 -0.027128313 0.016774829
## EnvironmentSatisfaction -0.027128313 1.000000000 -0.049856956
## HourlyRate 0.016774829 -0.049856956 1.000000000
## JobInvolvement 0.042437634 -0.008277598 0.042860641
## JobLevel 0.101588886 0.001211699 -0.027853486
## JobSatisfaction -0.011296117 -0.006784353 -0.071334624
## MonthlyIncome 0.094960677 -0.006259088 -0.015794304
## MonthlyRate -0.026084197 0.037599623 -0.015296750
## NumCompaniesWorked 0.126316560 0.012594323 0.022156883
## PercentSalaryHike -0.011110941 -0.031701195 -0.009061986
## RelationshipSatisfaction -0.009118377 0.007665384 0.001330453
## StockOptionLevel 0.018422220 0.003432158 0.050263399
## TotalWorkingYears 0.148279697 -0.002693070 -0.002333682
## TrainingTimesLastYear -0.025100241 -0.019359308 -0.008547685
## WorkLifeBalance 0.009819189 0.027627295 -0.004607234
## YearsAtCompany 0.069113696 0.001457549 -0.019581616
## YearsInCurrentRole 0.060235554 0.018007460 -0.024106220
## YearsSinceLastPromotion 0.054254334 0.016193606 -0.026715586
## YearsWithCurrManager 0.069065378 -0.004998723 -0.020123200
## JobInvolvement JobLevel JobSatisfaction
## Age 0.029819959 0.509604228 -0.0048918771
## DailyRate 0.046134874 0.002966335 0.0305710078
## DistanceFromHome 0.008783280 0.005302731 -0.0036688392
## Education 0.042437634 0.101588886 -0.0112961167
## EnvironmentSatisfaction -0.008277598 0.001211699 -0.0067843526
## HourlyRate 0.042860641 -0.027853486 -0.0713346244
## JobInvolvement 1.000000000 -0.012629883 -0.0214759103
## JobLevel -0.012629883 1.000000000 -0.0019437080
## JobSatisfaction -0.021475910 -0.001943708 1.0000000000
## MonthlyIncome -0.015271491 0.950299913 -0.0071567424
## MonthlyRate -0.016322079 0.039562951 0.0006439169
## NumCompaniesWorked 0.015012413 0.142501124 -0.0556994260
## PercentSalaryHike -0.017204572 -0.034730492 0.0200020394
## RelationshipSatisfaction 0.034296821 0.021641511 -0.0124535932
## StockOptionLevel 0.021522640 0.013983911 0.0106902261
## TotalWorkingYears -0.005533182 0.782207805 -0.0201850727
## TrainingTimesLastYear -0.015337826 -0.018190550 -0.0057793350
## WorkLifeBalance -0.014616593 0.037817746 -0.0194587102
## YearsAtCompany -0.021355427 0.534738687 -0.0038026279
## YearsInCurrentRole 0.008716963 0.389446733 -0.0023047852
## YearsSinceLastPromotion -0.024184292 0.353885347 -0.0182135678
## YearsWithCurrManager 0.025975808 0.375280608 -0.0276562139
## MonthlyIncome MonthlyRate NumCompaniesWorked
## Age 0.497854567 0.0280511671 0.299634758
## DailyRate 0.007707059 -0.0321816015 0.038153434
## DistanceFromHome -0.017014445 0.0274728635 -0.029250804
## Education 0.094960677 -0.0260841972 0.126316560
## EnvironmentSatisfaction -0.006259088 0.0375996229 0.012594323
## HourlyRate -0.015794304 -0.0152967496 0.022156883
## JobInvolvement -0.015271491 -0.0163220791 0.015012413
## JobLevel 0.950299913 0.0395629510 0.142501124
## JobSatisfaction -0.007156742 0.0006439169 -0.055699426
## MonthlyIncome 1.000000000 0.0348136261 0.149515216
## MonthlyRate 0.034813626 1.0000000000 0.017521353
## NumCompaniesWorked 0.149515216 0.0175213534 1.000000000
## PercentSalaryHike -0.027268586 -0.0064293459 -0.010238309
## RelationshipSatisfaction 0.025873436 -0.0040853293 0.052733049
## StockOptionLevel 0.005407677 -0.0343228302 0.030075475
## TotalWorkingYears 0.772893246 0.0264424712 0.237638590
## TrainingTimesLastYear -0.021736277 0.0014668806 -0.066054072
## WorkLifeBalance 0.030683082 0.0079631575 -0.008365685
## YearsAtCompany 0.514284826 -0.0236551067 -0.118421340
## YearsInCurrentRole 0.363817667 -0.0128148744 -0.090753934
## YearsSinceLastPromotion 0.344977638 0.0015667995 -0.036813892
## YearsWithCurrManager 0.344078883 -0.0367459053 -0.110319155
## PercentSalaryHike RelationshipSatisfaction
## Age 0.003633585 0.0535347197
## DailyRate 0.022703677 0.0078460310
## DistanceFromHome 0.040235377 0.0065574746
## Education -0.011110941 -0.0091183767
## EnvironmentSatisfaction -0.031701195 0.0076653835
## HourlyRate -0.009061986 0.0013304528
## JobInvolvement -0.017204572 0.0342968206
## JobLevel -0.034730492 0.0216415105
## JobSatisfaction 0.020002039 -0.0124535932
## MonthlyIncome -0.027268586 0.0258734361
## MonthlyRate -0.006429346 -0.0040853293
## NumCompaniesWorked -0.010238309 0.0527330486
## PercentSalaryHike 1.000000000 -0.0404900811
## RelationshipSatisfaction -0.040490081 1.0000000000
## StockOptionLevel 0.007527748 -0.0459524907
## TotalWorkingYears -0.020608488 0.0240542918
## TrainingTimesLastYear -0.005221012 0.0024965264
## WorkLifeBalance -0.003279636 0.0196044057
## YearsAtCompany -0.035991262 0.0193667869
## YearsInCurrentRole -0.001520027 -0.0151229149
## YearsSinceLastPromotion -0.022154313 0.0334925021
## YearsWithCurrManager -0.011985248 -0.0008674968
## StockOptionLevel TotalWorkingYears
## Age 0.037509712 0.680380536
## DailyRate 0.042142796 0.014514739
## DistanceFromHome 0.044871999 0.004628426
## Education 0.018422220 0.148279697
## EnvironmentSatisfaction 0.003432158 -0.002693070
## HourlyRate 0.050263399 -0.002333682
## JobInvolvement 0.021522640 -0.005533182
## JobLevel 0.013983911 0.782207805
## JobSatisfaction 0.010690226 -0.020185073
## MonthlyIncome 0.005407677 0.772893246
## MonthlyRate -0.034322830 0.026442471
## NumCompaniesWorked 0.030075475 0.237638590
## PercentSalaryHike 0.007527748 -0.020608488
## RelationshipSatisfaction -0.045952491 0.024054292
## StockOptionLevel 1.000000000 0.010135969
## TotalWorkingYears 0.010135969 1.000000000
## TrainingTimesLastYear 0.011274070 -0.035661571
## WorkLifeBalance 0.004128730 0.001007646
## YearsAtCompany 0.015058008 0.628133155
## YearsInCurrentRole 0.050817873 0.460364638
## YearsSinceLastPromotion 0.014352185 0.404857759
## YearsWithCurrManager 0.024698227 0.459188397
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## Age -0.019620819 -0.021490028 0.311308770
## DailyRate 0.002452543 -0.037848051 -0.034054768
## DistanceFromHome -0.036942234 -0.026556004 0.009507720
## Education -0.025100241 0.009819189 0.069113696
## EnvironmentSatisfaction -0.019359308 0.027627295 0.001457549
## HourlyRate -0.008547685 -0.004607234 -0.019581616
## JobInvolvement -0.015337826 -0.014616593 -0.021355427
## JobLevel -0.018190550 0.037817746 0.534738687
## JobSatisfaction -0.005779335 -0.019458710 -0.003802628
## MonthlyIncome -0.021736277 0.030683082 0.514284826
## MonthlyRate 0.001466881 0.007963158 -0.023655107
## NumCompaniesWorked -0.066054072 -0.008365685 -0.118421340
## PercentSalaryHike -0.005221012 -0.003279636 -0.035991262
## RelationshipSatisfaction 0.002496526 0.019604406 0.019366787
## StockOptionLevel 0.011274070 0.004128730 0.015058008
## TotalWorkingYears -0.035661571 0.001007646 0.628133155
## TrainingTimesLastYear 1.000000000 0.028072207 0.003568666
## WorkLifeBalance 0.028072207 1.000000000 0.012089185
## YearsAtCompany 0.003568666 0.012089185 1.000000000
## YearsInCurrentRole -0.005737504 0.049856498 0.758753737
## YearsSinceLastPromotion -0.002066536 0.008941249 0.618408865
## YearsWithCurrManager -0.004095526 0.002759440 0.769212425
## YearsInCurrentRole YearsSinceLastPromotion
## Age 0.212901056 0.216513368
## DailyRate 0.009932015 -0.033228985
## DistanceFromHome 0.018844999 0.010028836
## Education 0.060235554 0.054254334
## EnvironmentSatisfaction 0.018007460 0.016193606
## HourlyRate -0.024106220 -0.026715586
## JobInvolvement 0.008716963 -0.024184292
## JobLevel 0.389446733 0.353885347
## JobSatisfaction -0.002304785 -0.018213568
## MonthlyIncome 0.363817667 0.344977638
## MonthlyRate -0.012814874 0.001566800
## NumCompaniesWorked -0.090753934 -0.036813892
## PercentSalaryHike -0.001520027 -0.022154313
## RelationshipSatisfaction -0.015122915 0.033492502
## StockOptionLevel 0.050817873 0.014352185
## TotalWorkingYears 0.460364638 0.404857759
## TrainingTimesLastYear -0.005737504 -0.002066536
## WorkLifeBalance 0.049856498 0.008941249
## YearsAtCompany 0.758753737 0.618408865
## YearsInCurrentRole 1.000000000 0.548056248
## YearsSinceLastPromotion 0.548056248 1.000000000
## YearsWithCurrManager 0.714364762 0.510223636
## YearsWithCurrManager
## Age 0.2020886024
## DailyRate -0.0263631782
## DistanceFromHome 0.0144060484
## Education 0.0690653783
## EnvironmentSatisfaction -0.0049987226
## HourlyRate -0.0201232002
## JobInvolvement 0.0259758079
## JobLevel 0.3752806078
## JobSatisfaction -0.0276562139
## MonthlyIncome 0.3440788833
## MonthlyRate -0.0367459053
## NumCompaniesWorked -0.1103191554
## PercentSalaryHike -0.0119852485
## RelationshipSatisfaction -0.0008674968
## StockOptionLevel 0.0246982266
## TotalWorkingYears 0.4591883971
## TrainingTimesLastYear -0.0040955260
## WorkLifeBalance 0.0027594402
## YearsAtCompany 0.7692124251
## YearsInCurrentRole 0.7143647616
## YearsSinceLastPromotion 0.5102236358
## YearsWithCurrManager 1.0000000000
We see the only numerical variables that are heavily correlated are monthly income and job level. They have an R^2 of about \(0.95\). This makes sense and the correlation is not perfectly \(1\). Additionally, both variables seem very relevant to our question, so we choose to keep them both.
We decide that we want to predict on our data before any hire is
made. This means that we have to delete variables that can only be
determined by following a hire. This includes variables such as job
satisfaction and years at company.
Before we do this however, we
make a variable for prediction. We call it employee quality. “Quality”
is representative of the amount of money an employee makes our company
daily. This is why we are subtracting monthly income divided 30. Any
data that is a positive indicator of job performance or output leads to
a greater employee quality. Research led us to allocate the weight of
each variable. For instance, increased job involvement from employees
has been shown to lead to massive increases in a profit for companies.
Admittedly, this metric is quite arbitrary. It is most definitely
not a perfect representation of profit per customer, as this is likely
impossible to represent with a single number. This is because many
factors can not be measured, and some things such as interaction between
co workers can not be shown with one number. At any rate however, this
provides us a benchmark that certainly has some significance.
After
this is done, we look at a histogram of our new quality variable. Notice
that it is mostly normally distributed, further ensuring its validity as
an accurate measurement of quality. Also notice that the mean quality is
\(-16\) dollars, suggesting that we are
likely underestimating profit per worker.
employee$Quality <- with(employee, (100 * YearsAtCompany / Age) + (40 * JobInvolvement)
+ (20 * JobLevel) + ifelse(OverTime == "Yes", 30, 0) +
ifelse(PerformanceRating == "4", 150, 0) - (MonthlyIncome / 30))
employee$YearsAtCompany <- NULL
employee$YearsInCurrentRole <- NULL
employee$JobInvolvement <- NULL
employee$JobLevel <- NULL
employee$OverTime <- NULL
employee$PerformanceRating <- NULL
employee$MonthlyIncome <- NULL
employee$DailyRate <- NULL
employee$MonthlyRate <- NULL
employee$HourlyRate <- NULL
employee$StockOptionLevel <- NULL
employee$YearsSinceLastPromotion <- NULL
employee$YearsWithCurrManager <- NULL
employee$PercentSalaryHike <- NULL
employee$TrainingTimesLastYear <- NULL
employee$JobSatisfaction <- NULL
str(employee)
## 'data.frame': 1470 obs. of 16 variables:
## $ Age : int 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
## $ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : int 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
## $ EnvironmentSatisfaction : int 2 3 4 4 1 4 3 4 4 3 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
## $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
## $ RelationshipSatisfaction: int 1 4 2 3 4 3 1 2 2 2 ...
## $ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
## $ WorkLifeBalance : int 1 3 3 3 3 2 2 3 3 2 ...
## $ Quality : num 4.87 119.41 60.33 97.28 31.81 ...
summary(employee)
## Age Attrition BusinessTravel
## Min. :18.00 No :1233 Non-Travel : 150
## 1st Qu.:30.00 Yes: 237 Travel_Frequently: 277
## Median :36.00 Travel_Rarely :1043
## Mean :36.92
## 3rd Qu.:43.00
## Max. :60.00
##
## Department DistanceFromHome Education
## Human Resources : 63 Min. : 1.000 Min. :1.000
## Research & Development:961 1st Qu.: 2.000 1st Qu.:2.000
## Sales :446 Median : 7.000 Median :3.000
## Mean : 9.193 Mean :2.913
## 3rd Qu.:14.000 3rd Qu.:4.000
## Max. :29.000 Max. :5.000
##
## EducationField EnvironmentSatisfaction Gender
## Human Resources : 27 Min. :1.000 Female:588
## Life Sciences :606 1st Qu.:2.000 Male :882
## Marketing :159 Median :3.000
## Medical :464 Mean :2.722
## Other : 82 3rd Qu.:4.000
## Technical Degree:132 Max. :4.000
##
## JobRole MaritalStatus NumCompaniesWorked
## Sales Executive :326 Divorced:327 Min. :0.000
## Research Scientist :292 Married :673 1st Qu.:1.000
## Laboratory Technician :259 Single :470 Median :2.000
## Manufacturing Director :145 Mean :2.693
## Healthcare Representative:131 3rd Qu.:4.000
## Manager :102 Max. :9.000
## (Other) :215
## RelationshipSatisfaction TotalWorkingYears WorkLifeBalance Quality
## Min. :1.000 Min. : 0.00 Min. :1.000 Min. :-499.40
## 1st Qu.:2.000 1st Qu.: 6.00 1st Qu.:2.000 1st Qu.: -79.66
## Median :3.000 Median :10.00 Median :3.000 Median : 16.32
## Mean :2.712 Mean :11.28 Mean :2.761 Mean : -16.04
## 3rd Qu.:4.000 3rd Qu.:15.00 3rd Qu.:3.000 3rd Qu.: 70.68
## Max. :4.000 Max. :40.00 Max. :4.000 Max. : 306.63
##
# Assuming employee$Quality contains your quality data
hist(employee$Quality,
main = "Histogram of Employee Quality",
col = 'black',
border = 'white',
xlab = "$ Added per Day",
breaks = seq(-500, 350, by = 50))
# Adjust the breaks argument to set intervals of 50 on the x-axis
mean(employee$Quality)
## [1] -16.03619
We have to normalize our data in order to run some of our models. This ensures that all of our data has equal weight. Otherwise, some variables would be have too much weight in determining our predictions.
employeedummy <- as.data.frame(model.matrix(~. -1, data=employee))
normalize <- function(x){
(x - min(x))/(max(x) - min(x))
}
employee_n <- as.data.frame(lapply(employeedummy, normalize))
We make Quality a binary variable for prediction. We decide that if a
worker has an output of over \(50\)
dollars a day, they are a quality worker. Once again, this is an
arbitrary cutoff, but it gives us something to work with. It also
ensures that even if our quality metric is prone to error, those who are
rated as “high quality” are still very likely to produce a profit.
We then look at a graph showing the main problem with IBM’s hires. The
quality workers are those who are most likely to quit. Despite quality
workers making up about \(25\)% of
workers, about \(50\)% of those who
quit are among those quality workers. This is the fundamental issue with
our current hires. We are hiring for quality, but many of them are
quitting. We need a more holistic hire process, where we also search for
loyal workers.
employee_n$Quality <- employee$Quality
employee_n$Quality <- ifelse(employee_n$Quality > 50, 1, 0)
library(ggplot2)
ggplot(employee_n, aes(x = factor(AttritionYes), fill = factor(Quality))) +
geom_bar(position = "stack", color = "black") +
labs(title = " Attrition vs Quality",
x = "",
y = "Count") +
scale_x_discrete(labels =
c("Not Attrition", "Attrition", "Not Attrition, Attrition")) +
scale_fill_manual(values = c("white", "black"), guide = FALSE) +
facet_wrap(~factor(Quality, labels = c("Not Quality", "Quality"))) +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
axis.line = element_line(color = "black"),
strip.background = element_blank(),
strip.text.x = element_text(size = 12, face = "bold"))
We make a different very similar data set to predict attrition. We are careful not to include quality when prediction attrition, and vice versa. We are now prepared to make models predicting both attrition and quality. We take a quick at our datasets to confirm.
attrition <- employee_n
attrition$Quality <- NULL
attrition$AttritionNo <- NULL
employee_n$AttritionNo <- NULL
employee_n$AttritionYes <- NULL
str(employee_n)
## 'data.frame': 1470 obs. of 29 variables:
## $ Age : num 0.548 0.738 0.452 0.357 0.214 ...
## $ BusinessTravelTravel_Frequently : num 0 1 0 1 0 1 0 0 1 0 ...
## $ BusinessTravelTravel_Rarely : num 1 0 1 0 1 0 1 1 0 1 ...
## $ DepartmentResearch...Development: num 0 1 1 1 1 1 1 1 1 1 ...
## $ DepartmentSales : num 1 0 0 0 0 0 0 0 0 0 ...
## $ DistanceFromHome : num 0 0.25 0.0357 0.0714 0.0357 ...
## $ Education : num 0.25 0 0.25 0.75 0 0.25 0.5 0 0.5 0.5 ...
## $ EducationFieldLife.Sciences : num 1 1 0 1 0 1 0 1 1 0 ...
## $ EducationFieldMarketing : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EducationFieldMedical : num 0 0 0 0 1 0 1 0 0 1 ...
## $ EducationFieldOther : num 0 0 1 0 0 0 0 0 0 0 ...
## $ EducationFieldTechnical.Degree : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EnvironmentSatisfaction : num 0.333 0.667 1 1 0 ...
## $ GenderMale : num 0 1 1 0 1 1 0 1 1 1 ...
## $ JobRoleHuman.Resources : num 0 0 0 0 0 0 0 0 0 0 ...
## $ JobRoleLaboratory.Technician : num 0 0 1 0 1 1 1 1 0 0 ...
## $ JobRoleManager : num 0 0 0 0 0 0 0 0 0 0 ...
## $ JobRoleManufacturing.Director : num 0 0 0 0 0 0 0 0 1 0 ...
## $ JobRoleResearch.Director : num 0 0 0 0 0 0 0 0 0 0 ...
## $ JobRoleResearch.Scientist : num 0 1 0 1 0 0 0 0 0 0 ...
## $ JobRoleSales.Executive : num 1 0 0 0 0 0 0 0 0 0 ...
## $ JobRoleSales.Representative : num 0 0 0 0 0 0 0 0 0 0 ...
## $ MaritalStatusMarried : num 0 1 0 1 1 0 1 0 0 1 ...
## $ MaritalStatusSingle : num 1 0 1 0 0 1 0 0 1 0 ...
## $ NumCompaniesWorked : num 0.889 0.111 0.667 0.111 1 ...
## $ RelationshipSatisfaction : num 0 1 0.333 0.667 1 ...
## $ TotalWorkingYears : num 0.2 0.25 0.175 0.2 0.15 0.2 0.3 0.025 0.25 0.425 ...
## $ WorkLifeBalance : num 0 0.667 0.667 0.667 0.667 ...
## $ Quality : num 0 1 1 1 0 1 1 1 0 0 ...
summary(employee_n)
## Age BusinessTravelTravel_Frequently BusinessTravelTravel_Rarely
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.2857 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.4286 Median :0.0000 Median :1.0000
## Mean :0.4506 Mean :0.1884 Mean :0.7095
## 3rd Qu.:0.5952 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## DepartmentResearch...Development DepartmentSales DistanceFromHome
## Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.03571
## Median :1.0000 Median :0.0000 Median :0.21429
## Mean :0.6537 Mean :0.3034 Mean :0.29259
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.46429
## Max. :1.0000 Max. :1.0000 Max. :1.00000
## Education EducationFieldLife.Sciences EducationFieldMarketing
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.2500 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.5000 Median :0.0000 Median :0.0000
## Mean :0.4782 Mean :0.4122 Mean :0.1082
## 3rd Qu.:0.7500 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## EducationFieldMedical EducationFieldOther EducationFieldTechnical.Degree
## Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.3156 Mean :0.05578 Mean :0.0898
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.00000 Max. :1.0000
## EnvironmentSatisfaction GenderMale JobRoleHuman.Resources
## Min. :0.0000 Min. :0.0 Min. :0.00000
## 1st Qu.:0.3333 1st Qu.:0.0 1st Qu.:0.00000
## Median :0.6667 Median :1.0 Median :0.00000
## Mean :0.5739 Mean :0.6 Mean :0.03537
## 3rd Qu.:1.0000 3rd Qu.:1.0 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0 Max. :1.00000
## JobRoleLaboratory.Technician JobRoleManager JobRoleManufacturing.Director
## Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000 Median :0.00000
## Mean :0.1762 Mean :0.06939 Mean :0.09864
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000 Max. :1.00000
## JobRoleResearch.Director JobRoleResearch.Scientist JobRoleSales.Executive
## Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000 Median :0.0000
## Mean :0.05442 Mean :0.1986 Mean :0.2218
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.0000 Max. :1.0000
## JobRoleSales.Representative MaritalStatusMarried MaritalStatusSingle
## Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000 Median :0.0000
## Mean :0.05646 Mean :0.4578 Mean :0.3197
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.0000 Max. :1.0000
## NumCompaniesWorked RelationshipSatisfaction TotalWorkingYears WorkLifeBalance
## Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.1111 1st Qu.:0.3333 1st Qu.:0.150 1st Qu.:0.3333
## Median :0.2222 Median :0.6667 Median :0.250 Median :0.6667
## Mean :0.2992 Mean :0.5707 Mean :0.282 Mean :0.5871
## 3rd Qu.:0.4444 3rd Qu.:1.0000 3rd Qu.:0.375 3rd Qu.:0.6667
## Max. :1.0000 Max. :1.0000 Max. :1.000 Max. :1.0000
## Quality
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.3483
## 3rd Qu.:1.0000
## Max. :1.0000
summary(attrition)
## Age AttritionYes BusinessTravelTravel_Frequently
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.2857 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.4286 Median :0.0000 Median :0.0000
## Mean :0.4506 Mean :0.1612 Mean :0.1884
## 3rd Qu.:0.5952 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## BusinessTravelTravel_Rarely DepartmentResearch...Development DepartmentSales
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.0000 Median :1.0000 Median :0.0000
## Mean :0.7095 Mean :0.6537 Mean :0.3034
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## DistanceFromHome Education EducationFieldLife.Sciences
## Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.03571 1st Qu.:0.2500 1st Qu.:0.0000
## Median :0.21429 Median :0.5000 Median :0.0000
## Mean :0.29259 Mean :0.4782 Mean :0.4122
## 3rd Qu.:0.46429 3rd Qu.:0.7500 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.0000 Max. :1.0000
## EducationFieldMarketing EducationFieldMedical EducationFieldOther
## Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.00000
## Mean :0.1082 Mean :0.3156 Mean :0.05578
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.00000
## EducationFieldTechnical.Degree EnvironmentSatisfaction GenderMale
## Min. :0.0000 Min. :0.0000 Min. :0.0
## 1st Qu.:0.0000 1st Qu.:0.3333 1st Qu.:0.0
## Median :0.0000 Median :0.6667 Median :1.0
## Mean :0.0898 Mean :0.5739 Mean :0.6
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0
## Max. :1.0000 Max. :1.0000 Max. :1.0
## JobRoleHuman.Resources JobRoleLaboratory.Technician JobRoleManager
## Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.00000 Median :0.0000 Median :0.00000
## Mean :0.03537 Mean :0.1762 Mean :0.06939
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.0000 Max. :1.00000
## JobRoleManufacturing.Director JobRoleResearch.Director
## Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000
## Mean :0.09864 Mean :0.05442
## 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000
## JobRoleResearch.Scientist JobRoleSales.Executive JobRoleSales.Representative
## Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.00000
## Mean :0.1986 Mean :0.2218 Mean :0.05646
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.00000
## MaritalStatusMarried MaritalStatusSingle NumCompaniesWorked
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.1111
## Median :0.0000 Median :0.0000 Median :0.2222
## Mean :0.4578 Mean :0.3197 Mean :0.2992
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.4444
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## RelationshipSatisfaction TotalWorkingYears WorkLifeBalance
## Min. :0.0000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.3333 1st Qu.:0.150 1st Qu.:0.3333
## Median :0.6667 Median :0.250 Median :0.6667
## Mean :0.5707 Mean :0.282 Mean :0.5871
## 3rd Qu.:1.0000 3rd Qu.:0.375 3rd Qu.:0.6667
## Max. :1.0000 Max. :1.000 Max. :1.0000
We create a test/train split for our data. We will build our models with the train data and evaluate our models with the test data. We choose a ratio of 0.5 to ensure that we have ample data for both testing and training. We split our attrition and quality data sets with the same splits. Additionally, we split our numerical variable with the same split for later evaluation.
ratio <- 0.5
set.seed(122121)
trainRows <- sample(1:nrow(employee_n), ratio*nrow(employee_n))
employeeTrain <- employee_n[trainRows, ]
employeeTest <- employee_n[-trainRows, ]
employeeTestLabel <- employeeTest$Quality
employeeTrainLabel <- employeeTrain$Quality
employeeTestPredictors <- employeeTest[,-29]
employeeTrainPredictors <- employeeTrain[,-29]
attritionTrain <- attrition[trainRows, ]
attritionTest <- attrition[-trainRows, ]
attritionTestLabel <- attritionTest$AttritonYes
attritionTrainLabel <- attritionTrain$AttritionYes
attritionTestPredictors <- attritionTest[,-2]
attritionTrainPredictors <- attritionTrain[,-2]
quality <- employee[-trainRows,]
We use our train data to build a GLM model for quality. Once again, this only includes variables that can be acquired prior to a hire. We then check to see how our model predicts on the test data. We have a Kappa of \(0.40\).
library(caret)
## Loading required package: lattice
GlmModel <- glm(Quality~., data=employeeTrain, family="binomial")
summary(GlmModel)
##
## Call:
## glm(formula = Quality ~ ., family = "binomial", data = employeeTrain)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.475e+00 2.885e+03 -0.001 0.9996
## Age 2.875e-01 5.649e-01 0.509 0.6108
## BusinessTravelTravel_Frequently 3.785e-01 3.655e-01 1.036 0.3003
## BusinessTravelTravel_Rarely 1.900e-01 3.162e-01 0.601 0.5480
## DepartmentResearch...Development -8.904e-01 2.885e+03 0.000 0.9998
## DepartmentSales -4.120e-01 2.971e+03 0.000 0.9999
## DistanceFromHome 2.008e-02 3.287e-01 0.061 0.9513
## Education 9.411e-03 3.906e-01 0.024 0.9808
## EducationFieldLife.Sciences 9.548e-01 8.015e-01 1.191 0.2335
## EducationFieldMarketing 6.621e-01 8.869e-01 0.747 0.4553
## EducationFieldMedical 8.171e-01 8.021e-01 1.019 0.3083
## EducationFieldOther 8.951e-01 8.689e-01 1.030 0.3030
## EducationFieldTechnical.Degree 7.070e-01 8.304e-01 0.851 0.3945
## EnvironmentSatisfaction 1.171e-02 2.516e-01 0.047 0.9629
## GenderMale 1.187e-01 1.985e-01 0.598 0.5500
## JobRoleHuman.Resources 1.582e+00 2.885e+03 0.001 0.9996
## JobRoleLaboratory.Technician 2.384e+00 4.575e-01 5.210 1.88e-07 ***
## JobRoleManager -1.608e+01 1.329e+03 -0.012 0.9903
## JobRoleManufacturing.Director 8.881e-01 5.075e-01 1.750 0.0801 .
## JobRoleResearch.Director -1.608e+01 1.068e+03 -0.015 0.9880
## JobRoleResearch.Scientist 2.661e+00 4.549e-01 5.850 4.91e-09 ***
## JobRoleSales.Executive 8.691e-02 2.010e+03 0.000 1.0000
## JobRoleSales.Representative 2.622e+00 2.010e+03 0.001 0.9990
## MaritalStatusMarried -6.244e-01 2.434e-01 -2.566 0.0103 *
## MaritalStatusSingle -7.098e-01 2.578e-01 -2.753 0.0059 **
## NumCompaniesWorked -7.341e-01 3.638e-01 -2.018 0.0436 *
## RelationshipSatisfaction 3.091e-01 2.649e-01 1.167 0.2433
## TotalWorkingYears -1.686e+00 8.829e-01 -1.910 0.0561 .
## WorkLifeBalance -1.921e-01 3.866e-01 -0.497 0.6193
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 965.47 on 734 degrees of freedom
## Residual deviance: 697.65 on 706 degrees of freedom
## AIC: 755.65
##
## Number of Fisher Scoring iterations: 17
glmPred <- predict(GlmModel, newdata=employeeTest, type = "response")
glmBin <- ifelse(glmPred >= 0.5, 1, 0)
confusionMatrix(as.factor(glmBin), as.factor(employeeTest$Quality), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 380 87
## 1 112 156
##
## Accuracy : 0.7293
## 95% CI : (0.6956, 0.7611)
## No Information Rate : 0.6694
## P-Value [Acc > NIR] : 0.0002658
##
## Kappa : 0.4038
##
## Mcnemar's Test P-Value : 0.0888839
##
## Sensitivity : 0.6420
## Specificity : 0.7724
## Pos Pred Value : 0.5821
## Neg Pred Value : 0.8137
## Prevalence : 0.3306
## Detection Rate : 0.2122
## Detection Prevalence : 0.3646
## Balanced Accuracy : 0.7072
##
## 'Positive' Class : 1
##
We build our KNN model for employee quality and save our predictions. KNN takes a test data point and finds the data points from the train data that our closest to our test data point. It then uses a majority vote to choose what we assign this variable as (1 or 0). “\(K\)” is the amount of variables close to our test data point from the train data that we use to predict. We try many \(k\)’s and we determined that a \(k\) of \(13\) works the best on the data. Our rule of thumb says it should be \(\sqrt{735} \approx 27\) but this gives us a much worse model. This is likely due to the low amount of variables. We then evaluate how our model performs on the test data. We see that we have a Kappa of \(0.40\).
library(class)
KnnModel <- knn(train = employeeTrainPredictors, test = employeeTestPredictors, cl = employeeTrainLabel, k = 13)
confusionMatrix(as.factor(KnnModel), as.factor(employeeTest$Quality), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 392 97
## 1 100 146
##
## Accuracy : 0.732
## 95% CI : (0.6984, 0.7637)
## No Information Rate : 0.6694
## P-Value [Acc > NIR] : 0.0001433
##
## Kappa : 0.3963
##
## Mcnemar's Test P-Value : 0.8866897
##
## Sensitivity : 0.6008
## Specificity : 0.7967
## Pos Pred Value : 0.5935
## Neg Pred Value : 0.8016
## Prevalence : 0.3306
## Detection Rate : 0.1986
## Detection Prevalence : 0.3347
## Balanced Accuracy : 0.6988
##
## 'Positive' Class : 1
##
We make our Neural Network model to predict quality. This model is based on the human brain. In each layer, every neuron is connected to every neuron in the previous layer. We use \(5\) hidden layers of \(60\), \(30\), \(10\), \(6\), and \(4\) for a total of \(110\) neurons. We increase our learning rate factor and threshold in order to ensure that our program runs in a reasonable time. We then make predictions on our test data. We will not save our binary predictions because we will allow our decision tree to find a good threshold. However, we will use them to evaluate our model. We see that we achieve a Kappa of \(0.43\).
library(neuralnet)
set.seed(422)
annmodel <- neuralnet(Quality ~ ., data = employeeTrain, hidden = c(60, 30, 10,6,4), threshold = 5,
stepmax = 1e+05, rep = 1, startweights = NULL,
learningrate.limit = NULL, learningrate.factor = list(minus = 0.5,
plus = 1.2), learningrate = NULL, lifesign = "none",
lifesign.step = 1000, algorithm = "rprop+", err.fct = "sse",
act.fct = "logistic", linear.output = TRUE, exclude = NULL,
constant.weights = NULL, likelihood = FALSE)
library(caret)
annPred <- predict(annmodel, employeeTest)
annBin <- ifelse(annPred >= 0.5, 1, 0)
confusionMatrix(as.factor(annBin), as.factor(employeeTest$Quality), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 380 81
## 1 112 162
##
## Accuracy : 0.7374
## 95% CI : (0.704, 0.7689)
## No Information Rate : 0.6694
## P-Value [Acc > NIR] : 3.853e-05
##
## Kappa : 0.4253
##
## Mcnemar's Test P-Value : 0.03082
##
## Sensitivity : 0.6667
## Specificity : 0.7724
## Pos Pred Value : 0.5912
## Neg Pred Value : 0.8243
## Prevalence : 0.3306
## Detection Rate : 0.2204
## Detection Prevalence : 0.3728
## Balanced Accuracy : 0.7195
##
## 'Positive' Class : 1
##
We make our SVM model and make binary predictions on quality. We try many different kernels and evaluate the kappa of each model on the test data. We save the model with the best kappa to be our SVM model that goes into our decision tree. After running our model, we see that Vanilla performs the best so we save that model. It has a Kappa of \(0.48\).
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
library(caret)
kernels <- c("vanilladot", "rbfdot", "polydot", "tanhdot",
"laplacedot", "besseldot", "anovadot", "splinedot")
best_kappa <- -Inf
best_model <- NULL
best_predictions <- NULL
for (kernel in kernels) {
classifier <- ksvm(factor(Quality) ~ ., data = employeeTrain, kernel = kernel)
predictions <- predict(classifier, employeeTest)
predictions <- as.factor(predictions)
cm <- confusionMatrix(as.factor(predictions), as.factor(employeeTest$Quality), positive = "1")
kappa_value <- cm$overall["Kappa"]
if (kappa_value > best_kappa) {
best_kappa <- kappa_value
best_model <- kernel
best_predictions <- predictions
}
}
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
# Save the predictions of the best model to a dataframe
svmPred <- data.frame(Predictions = as.character(best_predictions))
svmPredictions <- as.factor(svmPred$Predictions)
# Print the best model and its kappa
cat("Best Model:", best_model, "- Best Kappa:", best_kappa, "\n")
## Best Model: vanilladot - Best Kappa: 0.4758871
Now we make our basic decision tree model for quality. Many branches are made with binary decisions for many different variables. At each leaf, \(1\) or \(0\) is chosen. We will feed these predictions into our larger decision tree model. First, we will evaluate these predictions. We see that we have a Kappa of \(0.39\).
library(C50)
dt <- C5.0(as.factor(Quality) ~., data = employeeTrain)
plot(dt)
dtpredict <- predict(dt, employeeTest)
confusionMatrix(as.factor(dtpredict), as.factor(employeeTest$Quality), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 402 104
## 1 90 139
##
## Accuracy : 0.7361
## 95% CI : (0.7026, 0.7676)
## No Information Rate : 0.6694
## P-Value [Acc > NIR] : 5.404e-05
##
## Kappa : 0.3948
##
## Mcnemar's Test P-Value : 0.3506
##
## Sensitivity : 0.5720
## Specificity : 0.8171
## Pos Pred Value : 0.6070
## Neg Pred Value : 0.7945
## Prevalence : 0.3306
## Detection Rate : 0.1891
## Detection Prevalence : 0.3116
## Balanced Accuracy : 0.6945
##
## 'Positive' Class : 1
##
We combine all of our previous model’s predictions into a single data frame for our employee quality predictions. We are also sure to put our quality prediction into the data set so that we are able to train our model. If a non-binary prediction exists we are sure to enter that one into the data frame as want our final decision tree to make the thresholds for us. We will build our final model later.
employeeModels <- data.frame(dtpredict, annPred, svmPredictions, KnnModel, glmPred, employeeTest$Quality)
str(employeeModels)
## 'data.frame': 735 obs. of 6 variables:
## $ dtpredict : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 1 1 1 ...
## $ annPred : num 0.519 0.671 0.575 0.356 0.337 ...
## $ svmPredictions : Factor w/ 2 levels "0","1": 2 2 2 2 1 1 2 1 2 1 ...
## $ KnnModel : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 2 1 2 1 ...
## $ glmPred : num 0.462 0.687 0.663 0.45 0.266 ...
## $ employeeTest.Quality: num 1 1 1 1 0 0 1 0 1 0 ...
We now use our train data to make a GLM model for prediction attrition. We have a Kappa of \(0.23\), indicating early on that our attrition model is less powerful than our quality model.
library(caret)
GlmModel <- glm(AttritionYes~., data=attritionTrain, family="binomial")
summary(GlmModel)
##
## Call:
## glm(formula = AttritionYes ~ ., family = "binomial", data = attritionTrain)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -15.0533 563.8412 -0.027 0.97870
## Age -0.3488 0.6730 -0.518 0.60427
## BusinessTravelTravel_Frequently 1.1585 0.4425 2.618 0.00884 **
## BusinessTravelTravel_Rarely 0.4902 0.3977 1.233 0.21769
## DepartmentResearch...Development 14.5732 563.8414 0.026 0.97938
## DepartmentSales 13.4705 563.8418 0.024 0.98094
## DistanceFromHome 1.5064 0.3543 4.252 2.12e-05 ***
## Education 0.1602 0.4556 0.352 0.72510
## EducationFieldLife.Sciences -2.5083 1.0772 -2.329 0.01988 *
## EducationFieldMarketing -1.8959 1.1242 -1.687 0.09170 .
## EducationFieldMedical -2.6981 1.0751 -2.510 0.01209 *
## EducationFieldOther -2.7292 1.1594 -2.354 0.01858 *
## EducationFieldTechnical.Degree -1.8200 1.0832 -1.680 0.09294 .
## EnvironmentSatisfaction -0.8173 0.2962 -2.759 0.00579 **
## GenderMale 0.2248 0.2338 0.962 0.33623
## JobRoleHuman.Resources 14.5066 563.8411 0.026 0.97947
## JobRoleLaboratory.Technician 0.9750 0.5226 1.866 0.06206 .
## JobRoleManager 0.8864 0.9116 0.972 0.33089
## JobRoleManufacturing.Director -0.1167 0.6672 -0.175 0.86120
## JobRoleResearch.Director -0.7997 1.1388 -0.702 0.48258
## JobRoleResearch.Scientist 0.5470 0.5243 1.043 0.29682
## JobRoleSales.Executive 1.7125 1.4095 1.215 0.22436
## JobRoleSales.Representative 3.1210 1.4600 2.138 0.03254 *
## MaritalStatusMarried 0.6628 0.3381 1.960 0.04995 *
## MaritalStatusSingle 1.6590 0.3388 4.897 9.73e-07 ***
## NumCompaniesWorked 1.4358 0.3981 3.607 0.00031 ***
## RelationshipSatisfaction -0.7563 0.3035 -2.492 0.01272 *
## TotalWorkingYears -2.0314 1.0776 -1.885 0.05942 .
## WorkLifeBalance -0.8707 0.4505 -1.933 0.05328 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 704.04 on 734 degrees of freedom
## Residual deviance: 550.47 on 706 degrees of freedom
## AIC: 608.47
##
## Number of Fisher Scoring iterations: 14
glmPred <- predict(GlmModel, newdata=attritionTest, type = "response")
glmBin <- ifelse(glmPred >= 0.5, 1, 0)
confusionMatrix(as.factor(glmBin), as.factor(attritionTest$AttritionYes), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 605 78
## 1 29 23
##
## Accuracy : 0.8544
## 95% CI : (0.8268, 0.8791)
## No Information Rate : 0.8626
## P-Value [Acc > NIR] : 0.759
##
## Kappa : 0.2286
##
## Mcnemar's Test P-Value : 3.478e-06
##
## Sensitivity : 0.22772
## Specificity : 0.95426
## Pos Pred Value : 0.44231
## Neg Pred Value : 0.88580
## Prevalence : 0.13741
## Detection Rate : 0.03129
## Detection Prevalence : 0.07075
## Balanced Accuracy : 0.59099
##
## 'Positive' Class : 1
##
We build our KNN model for attrition. This time, we find that a \(k\) of \(6\) works best. Once again, our model performs worse than it did for quality with a Kappa of \(0.19\).
library(class)
set.seed(8)
KnnModel <- knn(train = attritionTrainPredictors, test = attritionTestPredictors, cl = attritionTrainLabel, k = 6)
confusionMatrix(as.factor(KnnModel), as.factor(attritionTest$AttritionYes), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 607 82
## 1 27 19
##
## Accuracy : 0.8517
## 95% CI : (0.8239, 0.8766)
## No Information Rate : 0.8626
## P-Value [Acc > NIR] : 0.8194
##
## Kappa : 0.1887
##
## Mcnemar's Test P-Value : 2.313e-07
##
## Sensitivity : 0.18812
## Specificity : 0.95741
## Pos Pred Value : 0.41304
## Neg Pred Value : 0.88099
## Prevalence : 0.13741
## Detection Rate : 0.02585
## Detection Prevalence : 0.06259
## Balanced Accuracy : 0.57277
##
## 'Positive' Class : 1
##
We make a our Neural Network model for predicting attrition. We once again use \(5\) hidden layers of \(60\), \(30\), \(10\), \(6\), and \(4\) for a total of \(110\) neurons. We have a final Kappa of \(0.23\), once again worse than the quality prediction.
library(neuralnet)
set.seed(422)
annmodel <- neuralnet(AttritionYes ~ ., data = attritionTrain, hidden = c(60, 30, 10,6,4), threshold = 2,
stepmax = 1e+05, rep = 1, startweights = NULL,
learningrate.limit = NULL, learningrate.factor = list(minus = 0.5,
plus = 1.2), learningrate = NULL, lifesign = "none",
lifesign.step = 1000, algorithm = "rprop+", err.fct = "sse",
act.fct = "logistic", linear.output = TRUE, exclude = NULL,
constant.weights = NULL, likelihood = FALSE)
library(caret)
annPred <- predict(annmodel, attritionTest)
annBin <- ifelse(annPred >= 0.5, 1, 0)
confusionMatrix(as.factor(annBin), as.factor(attritionTest$AttritionYes), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 591 74
## 1 43 27
##
## Accuracy : 0.8408
## 95% CI : (0.8123, 0.8665)
## No Information Rate : 0.8626
## P-Value [Acc > NIR] : 0.959289
##
## Kappa : 0.2291
##
## Mcnemar's Test P-Value : 0.005546
##
## Sensitivity : 0.26733
## Specificity : 0.93218
## Pos Pred Value : 0.38571
## Neg Pred Value : 0.88872
## Prevalence : 0.13741
## Detection Rate : 0.03673
## Detection Prevalence : 0.09524
## Balanced Accuracy : 0.59975
##
## 'Positive' Class : 1
##
We now make an SVM model for attrition. We once again try many different kernels, with Anova performing the best this time. It has a Kappa of \(0.20\), once again much lower than our Kappa for quality.
library(kernlab)
library(caret)
kernels <- c("vanilladot", "rbfdot", "polydot", "tanhdot", "laplacedot", "besseldot", "anovadot", "splinedot")
best_kappa <- -Inf
best_model <- NULL
best_predictions <- NULL
for (kernel in kernels) {
classifier <- ksvm(factor(AttritionYes) ~ ., data = attritionTrain, kernel = kernel)
predictions <- predict(classifier, attritionTest)
predictions <- as.factor(predictions)
cm <- confusionMatrix(as.factor(predictions), as.factor(attritionTest$AttritionYes), positive = "1")
kappa_value <- cm$overall["Kappa"]
if (kappa_value > best_kappa) {
best_kappa <- kappa_value
best_model <- kernel
best_predictions <- predictions
}
}
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
# Save the predictions of the best model to a dataframe
svmPred <- data.frame(Predictions = as.character(best_predictions))
svmPredictions <- as.factor(svmPred$Predictions)
# Print the best model and its kappa
cat("Best Model:", best_model, "- Best Kappa:", best_kappa, "\n")
## Best Model: anovadot - Best Kappa: 0.1987956
Now we make our basic decision tree model for attrition. We see that we have a Kappa of \(0.20\), per usual this signifies less predicting power than our quality model.
library(C50)
dt <- C5.0(as.factor(AttritionYes) ~., data = attritionTrain)
plot(dt)
dtpredict <- predict(dt, attritionTest)
confusionMatrix(as.factor(dtpredict), as.factor(attritionTest$AttritionYes), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 589 76
## 1 45 25
##
## Accuracy : 0.8354
## 95% CI : (0.8065, 0.8615)
## No Information Rate : 0.8626
## P-Value [Acc > NIR] : 0.984274
##
## Kappa : 0.2027
##
## Mcnemar's Test P-Value : 0.006386
##
## Sensitivity : 0.24752
## Specificity : 0.92902
## Pos Pred Value : 0.35714
## Neg Pred Value : 0.88571
## Prevalence : 0.13741
## Detection Rate : 0.03401
## Detection Prevalence : 0.09524
## Balanced Accuracy : 0.58827
##
## 'Positive' Class : 1
##
We combine all of our previous model’s predictions into a single data frame for attrition predictions. We are also sure to put our response variable, attrtion, into the data set so that we are able to train our model. If a non-binary prediction exists we are sure to enter that one into the data frame as want our final decision tree to make the thresholds for us.
attritionModels <- data.frame(dtpredict, annPred, svmPredictions, KnnModel, glmPred, attritionTest$AttritionYes)
str(attritionModels)
## 'data.frame': 735 obs. of 6 variables:
## $ dtpredict : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ annPred : num 0.0499 -0.0227 0.4672 0.0863 0.1527 ...
## $ svmPredictions : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
## $ KnnModel : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
## $ glmPred : num 0.274 0.065 0.267 0.118 0.245 ...
## $ attritionTest.AttritionYes: num 1 0 0 0 0 0 1 0 0 0 ...
We now break our final data frame into test and train with a ratio \(.7/.3\). This will allow us to train our stacked decision trees and test on them. We split our data for both attrition and quality. Additionally, we give the same split to our numerial quality variable for later evaluation.
ratio <- 0.7
set.seed(69)
trainRowsFinal <- sample(1:nrow(employeeModels), ratio*nrow(employeeModels))
employeeTrain <- employeeModels[trainRowsFinal, ]
employeeTest <- employeeModels[-trainRowsFinal, ]
attritionTrain <- attritionModels[trainRowsFinal,]
attritionTest <- attritionModels[-trainRowsFinal,]
quality <- quality[-trainRowsFinal,]
quality <- quality$Quality
Now we make our stacked decision tree for quality. We use different costs for false positives and false negatives. We associate a cost of \(1.\) for false negatives and \(1.25\) for false positives. The standard setting for a decision tree assigns a cost of \(1\) for both. In this way, we are looking to avoid false positives as we want to ensure that we are hiring quality workers.
#1.25, 1
cost_matrix <- matrix(c(0,1.25,1,0), nrow = 2)
finalDt <- C5.0(as.factor(employeeTest.Quality) ~., data = employeeTrain, costs = cost_matrix)
## Warning: no dimnames were given for the cost matrix; the factor levels will be
## used
plot(finalDt)
employeepredict <- predict(finalDt, employeeTest)
confusionMatrix(as.factor(employeepredict), as.factor(employeeTest$employeeTest.Quality), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 129 26
## 1 22 44
##
## Accuracy : 0.7828
## 95% CI : (0.7226, 0.8353)
## No Information Rate : 0.6833
## P-Value [Acc > NIR] : 0.0006775
##
## Kappa : 0.4904
##
## Mcnemar's Test P-Value : 0.6650055
##
## Sensitivity : 0.6286
## Specificity : 0.8543
## Pos Pred Value : 0.6667
## Neg Pred Value : 0.8323
## Prevalence : 0.3167
## Detection Rate : 0.1991
## Detection Prevalence : 0.2986
## Balanced Accuracy : 0.7414
##
## 'Positive' Class : 1
##
Our stacked model performed better than any of the base models, with a Kappa of \(0.49\). Also, our cost matrix led to a less amount of false positives, meaning we are mostly selecting quality workers.
It is paramount that we do not hire workers who are will just quit. Therefore we assign a cost of \(5\) for false negatives and only \(1\) for false positives. This will ensure that almost everyone we predict will not quit, will not actually quit.
#1, 5
cost_matrix <- matrix(c(0,1,5,0), nrow = 2)
finalDt <- C5.0(as.factor(attritionTest.AttritionYes) ~., data = attritionTrain, costs = cost_matrix)
## Warning: no dimnames were given for the cost matrix; the factor levels will be
## used
plot(finalDt)
attritionpredict <- predict(finalDt, attritionTest)
confusionMatrix(as.factor(attritionpredict), as.factor(attritionTest$attritionTest.AttritionYes), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 144 11
## 1 39 27
##
## Accuracy : 0.7738
## 95% CI : (0.7128, 0.8272)
## No Information Rate : 0.8281
## P-Value [Acc > NIR] : 0.9846951
##
## Kappa : 0.385
##
## Mcnemar's Test P-Value : 0.0001343
##
## Sensitivity : 0.7105
## Specificity : 0.7869
## Pos Pred Value : 0.4091
## Neg Pred Value : 0.9290
## Prevalence : 0.1719
## Detection Rate : 0.1222
## Detection Prevalence : 0.2986
## Balanced Accuracy : 0.7487
##
## 'Positive' Class : 1
##
We have a kappa of about \(0.39\), which is significantly better than all of our base models. This shows the power of the cost matrix. Additionally, we have an extremely small number of false negatives, as intended. This shows that our model has performed very well for our goal.
We make a plot to look at the performance of all our models. This allows us to better understand our performance. Notice the massive improvement in our attrition Kappa with our ultimate, stacked model.
library(ggplot2)
library(tidyr)
# Example data
Quality <- c(0.40, 0.40, 0.43, 0.48, 0.39, 0.49)
Attrition <- c(0.23, 0.19, 0.23, 0.20, 0.20, 0.39)
Models <- c("GLM", "KNN", "ANN", "SVM", "DT", "Ultimate")
data <- data.frame(Models, Quality, Attrition)
# Reshape the data
melted_data <- data %>%
pivot_longer(cols = c(Quality, Attrition), names_to = "Variable", values_to = "Kappa")
# Plot
ggplot(melted_data, aes(x = Models, y = Kappa, fill = Variable)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.8), color = "darkgrey") +
scale_fill_manual(values = c("black", "white")) + # Set colors manually
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
We decide to hire someone if we predict that they are both a quality
worker, and they will not quit. We call this “doHire.” We make confusion
matrices to see how many of hires are quality and how many will
eventually quit. We see that we chose to hire only \(4\) workers who will eventually quit, less
than \(15%\) of our hires.
We also
check how accurate our model is on choosing to hire. We check the actual
quality and attrition values and then use the same criteria for hire. We
then compare these results to our predictions. We did quite well, with
\(18\) hires being good hires and only
\(15\) being poor. With all of the
variables we had deleted, this is a staggering performance. We have a
final Kappa of \(0.32\).
doHire <- ifelse(employeepredict == 1 & attritionpredict == 0, 1, 0)
conf <- confusionMatrix(as.factor(doHire), as.factor(attritionTest$attritionTest.AttritionYes), positive = "1")
confusionMatrix(as.factor(doHire), as.factor(employeeTest$employeeTest.Quality), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 139 49
## 1 12 21
##
## Accuracy : 0.724
## 95% CI : (0.66, 0.7818)
## No Information Rate : 0.6833
## P-Value [Acc > NIR] : 0.1086
##
## Kappa : 0.257
##
## Mcnemar's Test P-Value : 4.04e-06
##
## Sensitivity : 0.30000
## Specificity : 0.92053
## Pos Pred Value : 0.63636
## Neg Pred Value : 0.73936
## Prevalence : 0.31674
## Detection Rate : 0.09502
## Detection Prevalence : 0.14932
## Balanced Accuracy : 0.61026
##
## 'Positive' Class : 1
##
shouldHire <-ifelse(employeeTest$employeeTest.Quality == 1 & attritionTest$attritionTest.AttritionYes == 0, 1, 0)
conf_matrix <- confusionMatrix(as.factor(doHire), as.factor(shouldHire), positive = "1")
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 157 31
## 1 15 18
##
## Accuracy : 0.7919
## 95% CI : (0.7323, 0.8434)
## No Information Rate : 0.7783
## P-Value [Acc > NIR] : 0.34751
##
## Kappa : 0.3172
##
## Mcnemar's Test P-Value : 0.02699
##
## Sensitivity : 0.36735
## Specificity : 0.91279
## Pos Pred Value : 0.54545
## Neg Pred Value : 0.83511
## Prevalence : 0.22172
## Detection Rate : 0.08145
## Detection Prevalence : 0.14932
## Balanced Accuracy : 0.64007
##
## 'Positive' Class : 1
##
conf_matrix_df <- as.data.frame(as.matrix(conf_matrix$table))
conf_df <- as.data.frame(as.matrix(conf$table))
ggplot(data = conf_matrix_df, aes(x = Reference, y = Prediction)) +
geom_tile(aes(fill = Freq), colour = "black") +
# Change the color of borders to black
geom_text(aes(label = Freq), color = "black", family = "Arial") +
# Change text color to black and font family to Arial
theme_minimal() +
scale_fill_gradient(low = "white", high = "darkgrey") +
labs(x = "Ideal Hire Choice", y = "Model's Hire Choice",
title = " Confusion Matrix of Hires") +
theme(axis.text = element_text(size = 12, family = "Times New Roman"),
# Change font family for axis text
axis.title = element_text(size = 14, face = "bold", family = "Times New Roman"),
# Change font family for axis titles
plot.title = element_text(size = 16, face = "bold", family = "Times New Roman"))
ggplot(data = conf_df, aes(x = Reference, y = Prediction)) +
geom_tile(aes(fill = Freq), colour = "black") +
# Change the color of borders to black
geom_text(aes(label = Freq), color = "black", family = "Arial") +
# Change text color to black and font family to Arial
theme_minimal() +
scale_fill_gradient(low = "white", high = "darkgrey") +
labs(x = "Hire would Quit", y = "Recommended Hire",
title = " Recommended Hire Compared to Attrition") +
theme(axis.text = element_text(size = 12, family = "Times New Roman"),
# Change font family for axis text
axis.title = element_text(size = 14, face = "bold", family = "Times New Roman"),
# Change font family for axis titles
plot.title = element_text(size = 16, face = "bold", family = "Times New Roman"))
We kept our numerical quality from earlier for a reason. We want to evaluate our total final profit. First, we look a histogram of the Quality of our recommended hires. Notice that even those below 50, which we rate as “Not Quality,” are still above \(0\) or not far below it. This is a very good sign. Finally, we take our \(4\) workers who will quit out of our data set and we add up the remaining hires quality. This gives us our final increase in dollars per hour. We have a net profit of about \(2,500\) dollars. This means we have \[\frac{2,500}{33} \approx 75 \text{ dollars per hire}\]
qualityOfHires = data.frame(quality, doHire, attritionTest$attritionTest.AttritionYes)
subset <- qualityOfHires[qualityOfHires$doHire == 1,]
hist(subset$quality,
main = "Employee Quality of Recommended Hires",
col = 'black',
border = 'white',
xlab = "$ Added per Day",
breaks = seq(-50, 250, by = 25))
subset <- subset[subset$attritionTest.attritionTest.AttritionYes == 0,]
sum(subset$quality)
## [1] 2518.822
Our model was largely successful and powerful on predicting who we
should hire based on our metrics. This is made evident by the average
profit per day per worker of \(75\)
dollars. This number may be based on our arbitrary assumptions, but it
is such as massive increase from the initial \(-16\) dollars that its power can not be
denied. And at any rate, we are increasing variables that are certainly
correlated with increased job performance. Its power is meaningful as we
initially deleted any variables highly correlated with on the variables
we were predicting. The only variables remaining were those giving basic
information on a candidate such as education and distance from home.
This ensures that our model can be used when trying to make
hires.
With the power of the cost matrix, we were able to nearly entirely
avoid workers who will eventually quit. This would lead to much more
long-term profits. Our recommended hires are of both high quality and
likely to be loyal. This is what led to about \(75\) extra dollars per day per worker with
our recommended hires.
With that said, our model was not perfect. Our model did recommend
hiring some individuals who would not lead to increased profits. Due to
this, we recommend that our model be used in conjunction with other
hiring methods. This may include typical resume drops or interviews. In
this way, with a holistic approach, IBM can make the most profitable
hires possible.
Expanding on that previous note, our current model is based on the idea that we are using the model as the sole method for hiring. The results are good, but they can be improved upon with a holistic approach. This entails using our model to screen candidates for interviews. Therefore, we would entirely change the cost matrices. We currently have \((0, 5, 4, 0)\) for quality. If we were screening, we may change this to \((0,1,1,0)\). Likewise, we may change our attrition cost matrix from \((0,1,5,0)\) to \((0,1,2,0)\).